merge: dev → master (admin_users scope migration v1.5 + authorship_review table + reporter ETL + abstract fixes)#94
Open
paulalbert1 wants to merge 32 commits into
Open
merge: dev → master (admin_users scope migration v1.5 + authorship_review table + reporter ETL + abstract fixes)#94paulalbert1 wants to merge 32 commits into
paulalbert1 wants to merge 32 commits into
Conversation
The analysis_nih table was loaded with ~527K rows (2x duplicates) on Dec 18, 2025. Since then, every nightly run retrieves the correct ~267K rows but validation rejects the swap (267K/527K = 50.7% < 80% threshold). Changes: - validate_data() now detects duplicate pmids in production and uses the unique count for percentage comparison, allowing the swap to self-heal - create_staging_tables() adds a UNIQUE constraint on pmid for analysis_nih_new to prevent future duplicate inserts - Added check_production_integrity() utility for diagnostic use - Schema: changed analysis_nih.idx_pmid from KEY to UNIQUE KEY Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update README, cleanup .gitignore, remove unused artifacts
…ass-dev Fix: NIH validation blocking table swap for 77+ days
…standalone SP STEP 4: Replace person.title with person_person_type LEFT JOIN chain (COALESCE) for facultyRank derivation. Replace article counts that counted all articles with filtered counts (Research Article + percentileNIH IS NOT NULL only). STEP 5: Replace all 8 combined UPDATE statements (using wrong threshold-percentage formula) with 24 separate UPDATE statements matching the canonical createEventsProceduresReciterDb.sql v2: - Percentile: AVG of top N articles ranked by percentileNIH DESC - Denominator: Count of peers at same facultyRank meeting thresholds - Rank: RANK() OVER (PARTITION BY facultyRank ORDER BY percentile DESC) All 8 metrics (top5/10 x All/First/Senior/FirstSenior) now match the canonical logic. Steps 1-3, 6, 7 verified identical and unchanged. Ref: createEventsProceduresReciterDb.sql lines 3720-4252
Capture new top-level and feedbackEvidence fields emitted by ReCiter: - datePublicationAddedToPMC (person_article, analysis_summary_article) - feedbackScoreTextSimilarity (person_article) - feedbackScoreJournalTitleSimilarity (person_article) - feedbackScoreBibliographicCoupling (person_article) Schema, nightly SP (v2 + inline copy), legacy loose-SQL insert, CSV transformer, and LOAD DATA column list updated in lockstep. Adds idempotent migration setup/alter_add_feature_generator_fields_v1.1.sql for applying to prod and dev DBs out-of-band. ALTER must run BEFORE deploying the updated ETL image, otherwise LOAD DATA fails on unknown columns.
Applied to prod + dev on MariaDB 11.4 / 10.6 in ~25s (network-bound). AFTER placement forces INPLACE algorithm with metadata lock; appending at end allows INSTANT (no table rewrite, no lock hold).
The v2 populateAnalysisSummaryTables SP was selecting articleYear directly from person_article without a fallback, leaving 74% of analysis_summary_article rows with articleYear = 0. The legacy SP had a post-INSERT UPDATE step that derived the year from publicationDateStandardized in those cases, which the v2 rewrite dropped. Push the fallback into the SELECT itself: IF(articleYear != 0, articleYear, LEFT(publicationDateStandardized, 4)) Verified post-deployment: 0 rows with articleYear = 0 in prod and dev.
The iCite API returns two arrays per record: - cited_by: PMIDs that cite the queried article - references: PMIDs the queried article cites The CSV write for cited_by entries used [cited_by_elem, queried_pmid], but the LOAD DATA columns are (cited_pmid, citing_pmid). Since cited_by elements are the CITING articles (not the cited), this stored every cited_by edge with the cited/citing columns swapped. The references loop was already correct. Same inversion affected analysis_nih_cites_clin. Effect: queries like SELECT COUNT(*) FROM analysis_nih_cites WHERE cited_pmid = <pmid> returned only the rows sourced from other queried PMIDs' references arrays, missing the (typically much larger) cited_by set. Example: PMID 32432483 has iCite citation_count=192 but only 19 rows surfaced because just 19 WCM-tracked papers happened to reference it. After this change, on the next nightly run the table swap will load analysis_nih_cites with semantically correct (cited_pmid, citing_pmid) columns, matching the join used by downstream consumers (Scholars-Profile-System/lib/api/publication-detail.ts).
fix(retrieveNIH): correct column inversion in analysis_nih_cites load
abstractImport.py wrote abstracts to a CSV with csv.writer, then bulk-loaded
it with LOAD DATA LOCAL INFILE. The two are incompatible: csv.writer emits
\r\n line endings and doubled-quote escaping (""), while the LOAD used
LINES TERMINATED BY '\n' and ENCLOSED BY '"' with backslash escaping. Any
abstract containing a double quote desynced MySQL's row parser, so ~99.9% of
every file was dropped with no error raised (~43 of ~72,000 rows loaded).
The unresolved PMIDs were re-selected every cycle and the unbounded
while-True loop ran until the 15,000s pipeline timeout, restart-looping
indefinitely and blocking the nightly CronJob.
- Replace CSV + LOAD DATA with a parameterized, batched executemany INSERT
so abstract text is stored verbatim regardless of content.
- Bound the fetch/insert loop: stop on no progress, cap at 25 cycles, so
unresolvable PMIDs can no longer hang the pipeline.
- Retry batch_get_item UnprocessedKeys with backoff instead of dropping
throttled keys.
- Connect with charset=utf8mb4; drop the now-unused local_infile flag.
- Add a --dry-run mode that verifies the fetch/insert path against a
TEMPORARY table without touching reporting_abstracts.
fix(abstractImport): replace CSV bulk-load that silently dropped rows
Adds a new nightly ETL step (retrieveReporter.py) that pulls grant metadata
and pub-grant linkages from NIH RePORTER (api.reporter.nih.gov/v2) and
reconciles them against the existing PubMed-derived person_article_grant
table.
Three new tables (alter_add_reporter_fields_v1.2.sql):
- grant_reporter_project — RePORTER /projects/search results, refreshed
each cycle (truncate-reload). Includes abstract_text for cross-reference.
- grant_reporter_link — RePORTER /publications/search (pmid, appl_id) pairs.
- grant_provenance — long-lived per-(person, pmid, grant) audit log with
source_reporter, source_reciterdb, *_first_seen, last_verified. Survives
the nightly truncate-reload of person_article_grant. Keyed by
(personIdentifier, pmid, core_project_num) where core_project_num is the
normalized NIH grant identifier (e.g. R01DK127777).
ETL strategy:
- Projects: filtered by org_names=["WEILL MEDICAL COLL OF CORNELL UNIV"],
partitioned by fiscal year to stay under the 9,999 offset cap (WCM has
~15K projects historically). 1 req/sec rate limit honored.
- Publications: keyed by appl_ids from projects, batched.
- Reconciliation: bulk INSERT...SELECT ON DUPLICATE KEY UPDATE on both
sides. Reciterdb side stages normalized grant strings via temp table +
LOAD DATA LOCAL INFILE; RePORTER side joins grant_reporter_link to
person_article (userAssertion='ACCEPTED') as the false-positive guard.
- Subaward caveat: WCM-as-subaward will be missed by the org filter
(false-negative tradeoff accepted to keep false positives near zero).
Validated on dev: 235K provenance rows from 33K reciterdb + 132K RePORTER
inputs. End-to-end ~18 min (15 min publications fetch is the rate-limited
floor; reconciliation completes in ~10 sec via bulk SQL).
RePORTER /projects/search returns NIH-curated keyword vocabularies alongside the abstract. Capture them into grant_reporter_project so the Scholars-Profile-System funding ETL can project them onto grants as a topical search signal (issue #291). - alter_add_reporter_terms_v1.3.sql: ADD COLUMN project_terms, pref_terms via the information_schema-guarded idiom, safe on the live table. - v1.2 CREATE TABLE: mirror the two columns for fresh builds. - retrieveReporter.py: pull `terms` and `pref_terms` from each project dict, stored raw (terms angle-bracket-wrapped, pref_terms semicolon-delimited).
feat(reporter): NIH RePORTER ETL + project-term capture (#291) [dev]
…t reload fetch_projects() partitions the search by fiscal year when the corpus exceeds RePORTER's 9,999 offset cap (WCM has ~15K projects). RePORTER returns a multi-year project under every fiscal year it was active, so the same appl_id comes back in multiple FY pages. appl_id is grant_reporter_project's PRIMARY KEY, so the unguarded reload hit "IntegrityError 1062: Duplicate entry for key 'PRIMARY'" and the run aborted after TRUNCATE had already emptied the table. Dedup the projects loop by appl_id, mirroring the seen_pairs guard the publications loop already uses.
fix(reporter): dedup projects by appl_id before grant_reporter_project reload [dev]
run_all.py invokes retrieveReporter.py, but the Dockerfile's per-file COPY list was never updated when the RePORTER ETL step was added (PR #81), so the built image lacked the script. Every nightly run crashed at step 4 with "can't open file '/usr/src/app/retrieveReporter.py'", halting the pipeline before nightly indexing, abstractImport, and conflictsImport. With restartPolicy: OnFailure this produced an indefinite ~95-min restart loop.
fix(docker): add missing COPY for retrieveReporter.py
…SV/LOAD DATA path The pre-PR-#78 abstractImport.py wrote rows with csv.writer (RFC-4180 doubled-quote escaping) and LOAD DATA LOCAL INFILE (backslash escaping). On rows whose abstract text contained ", MySQL's parser desynced and consumed multiple TSV rows into one field until the next recoverable delimiter, producing a single reporting_abstracts row attributed to one pmid but containing concatenated text from several other papers in the same DynamoDB batch. The May 16 fix (PR #78) stopped new corruption but did not repair existing rows. fetch_missing_pmids() uses LEFT JOIN ... WHERE a.pmid IS NULL, so any pmid with a (corrupted) row was permanently skipped. Audit against prod found 3,124 corrupted rows out of 391,238 total. 84% are pegged at the 65,535-byte BLOB cap with the original pmid's abstract at the head and text from several unrelated papers at the tail. 181 additional rows are clean-pair duplicates from the same era (no UNIQUE constraint on pmid; concurrent old-import runs produced identical content twice). Verification against DynamoDB also spared 464 legitimately long abstracts (structured / consensus / multi-arm trials, lengths up to ~32K) that a length-only filter would have wrongly purged. This change ships the tooling and schema for the cleanup: - update/auditAbstracts.py -- read-only forensic audit. Compares reporting_abstracts.abstract against DynamoDB.PubMedArticle (the same source abstractImport.py uses) and classifies each row >= 4000 chars as CLEAN, PREFIX_CORRUPTED, DISJOINT, EMPTY_IN_DYNAMO, or MISSING_IN_DYNAMO. Writes audit_abstracts.csv plus a dump of the worst offenders. - update/repairAbstracts.py -- destructive cleanup. Backs up affected rows to a timestamped table, deletes corrupted rows in batches, then dedupes any remaining pmid duplicates by keeping MIN(id). Requires --apply; default is dry-run. Confirms the v1.4 migration precondition (no duplicates) after running. - setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql -- adds UNIQUE KEY on reporting_abstracts.pmid. information_schema-guarded; aborts if any duplicates remain. Mirrors the analysis_nih fix from PR #71 after the Dec 2025 duplicate-loading incident. - setup/createDatabaseTableReciterDb.sql -- idx_pmid is now UNIQUE for fresh installs. - .gitignore -- excludes audit / repair artifacts at the repo root (they contain prod abstract text). DBA runbook after merge: 1. python3 update/auditAbstracts.py 2. python3 update/repairAbstracts.py (dry-run; review counts) 3. python3 update/repairAbstracts.py --apply 4. mysql ... < setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql 5. Next nightly abstractImport.py backfills the deleted PMIDs via the parameterized path. Refs: #87, PR #78.
…y halts Two bugs discovered when running the runbook end to end against prod: 1. update/repairAbstracts.py used `CREATE TABLE LIKE` + `INSERT INTO ... SELECT *`. Prod added a STORED generated column `abstract_len` (with a composite index `idx_abs_pmid_len`) directly without updating the repo's schema file. The SELECT * pulled the generated-column value; MariaDB's strict mode rejected it with error 1906 and the backup aborted with the destination table empty. The fix queries information_schema for non-generated columns and enumerates them in both the corrupted-rows INSERT and the dedup INSERT. The script now works whether or not abstract_len exists. 2. setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql's precondition guard used `SELECT 'aborted...' AS error, 1/0 AS force_error`. The 1/0 only emits a warning in MariaDB's default SQL mode -- it does NOT halt the script. So when the DBA uploaded the migration before running the repair, the precondition SELECT was printed but execution continued to the ALTER, which then failed with `Duplicate entry '9182809' for key 'idx_pmid'`. The fix synthesizes a SELECT against a non-existent table whose name encodes the duplicate count (`__migration_aborted_reporting_abstracts_has_N_duplicate_pmids__ _run_update_repairAbstracts_py_first`). The resulting "Table doesn't exist" error halts the SQL client immediately, and the error text itself tells the operator what to do. Schema drift (the live abstract_len + idx_abs_pmid_len that aren't in createDatabaseTableReciterDb.sql) is a separate concern not addressed here.
fix(reporting_abstracts): repair cross-paper concatenation from old CSV/LOAD DATA path (#87)
Durable curator-state table for the Publication Manager 'Authorships' review queue (Curator_All). Survives the nightly truncate-reload like grant_provenance — it is in no truncate list (updateReciterDB.py all_tables) and touched by no stored procedure; populated externally by the adversarial-attribution-review producer. CREATE TABLE IF NOT EXISTS, additive, no change to existing ETL.
Add durable authorship_review table (PM Authorships queue)
Adds scope_person_types, scope_org_units, proxy_person_ids (JSON NULL) to admin_users via an idempotent information_schema-guarded ALTER, for existing databases that predate the fresh-build schema (#92, master). The Publication Manager dev-branch AdminUser model (commit 579d32f) selects these columns during login; without them findOrcreateAdminUser fails with 'Unknown column' and authentication returns 401 for every user. Run before deploying the PM dev branch against an existing reciterdb (e.g. production). Additive only; admin_users is durable (not in updateReciterDB.py all_tables truncate list). No-op on re-run.
feat(setup): add admin_users scope/proxy column migration (v1.5)
…ble (#95) Add a nightly ETL step that scans the ReCiter DynamoDB ArticleProvenance table and loads first-retrieval provenance into a new reciterdb table, article_provenance, so Publication Manager can display the date a publication was first retrieved (PM #737). - update/retrieveArticleProvenance.py: streams the scan into an article_provenance_new staging table (INSERT IGNORE per page to bound memory and collapse duplicates), validates row counts against production, then atomic RENAME-swaps. Converts frd (epoch seconds, UTC) to DATETIME. Mirrors the retrieveNIH staging->swap pattern. - setup/createDatabaseTableReciterDb.sql: article_provenance DDL for fresh installs, keyed (pmid, personIdentifier). - setup/alter_add_article_provenance_v1.6.sql: migration for existing dev/prod databases. - update/run_all.py: run the step before nightly indexing as non-fatal so a failure cannot block the indexing SP. - Dockerfile: COPY the new script. - README: document the step and backfill the missing retrieveReporter row. The table is keyed on (pmid, personIdentifier): the source DynamoDB table has a composite key uid (HASH) + articleId (RANGE), one item per person+article, so frd is per-person -- not global-per-article as issue #95 assumed.
feat(etl): scan ArticleProvenance (DynamoDB) -> article_provenance table (#95)
Revert #96: drop article_provenance batch ETL in favor of on-the-fly DynamoDB reads
3-places mirror of ReCiter-Publication-Manager schema: - add admin_permissions, admin_role_permissions, admin_permission_resources to createDatabaseTableReciterDb.sql - add impersonatedByUserID column to admin_feedback_log - add table_admin_permissions.sql with the permission/role-mapping/nav seed Mirrors PM scripts/migrations/add-permission-tables.sql and add-impersonated-by-feedbacklog.sql. Refs #739, #733.
feat(setup): mirror RBAC permission tables + impersonatedByUserID column (#739, #733)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Propagates
devtomaster(prod). You control the merge — opened for review, not auto-merged.Key contents (dev not in master)
setup/alter_add_admin_user_scope_columns_v1.5.sql— idempotent ALTER addingscope_person_types/scope_org_units/proxy_person_ids(JSON NULL) toadmin_users. PM-deploy blocker (PM dev SELECTs these on every login). Reviewed APPROVE.setup/table_authorship_review.sql— durable authorship_review table (authorship-review initiative).The pipeline does not run
setup/*.sqlDDL against the DB. The v1.5 ALTER must be applied directly to each existing DB (the prod reciterdb is being updated separately as part of this work). Merging only updates the repo/fresh-build path.Divergence
masteris 35 commits ahead ofdev(bidirectional divergence) — this is a real integration, reconcile on merge. v1.5 ALTER confirmed not present on master.